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1. INTRODUCTION 

In this paper we present a simplified model of grammar identification, which tries to catch 
some of the main features of the process by which a child acquires a language. Formally 
this amounts to solve the following statistical question. In a subshift of finite type, how 
to infer the incidence matrix, given a finite sample chosen according to a Gibbs measure 
whose potential is known. 

In this introduction we will sketch, in a very simplified way, how the problem is 
formulated from the point of view of Linguistics. In the next section we present the 
mathematical model and state the theorems. Readers who are only interested in the 
mathematical aspects of the problem should go directly to the next section. 

In Chomsky's Principles and Parameters framework, the problem of understanding 
language acquisition can be roughly formulated in the following terms. A child has a genetic 
inherited linguistic capacity which makes him able to learn a language. This linguistic 
capacity is characterized by a finite set of constraints which distinguish natural languages 
among all the possible formal languages. This set of containts is what Chomsky calls the 
Universal Grammar. Any particular solution of these contraints is called a grammar and 
defines in a precise way a natural language. Therefore, "learning a language" is nothing 
but identifying an element in the set of natural grammars. We refer the reader to Chomsky 
1986, for a comprehensive introduction to the Principles and Paramenters Model. 

To identify the parental grammar a learning child is guided by the linguistic informa- 
tion available in his environment. Psycholinguists agree that corrections of wrong construc- 
tions do not play an important role in the learning process (cf. McNeill 1966). Therefore 
the model must use only positive evidence as a basis of inference. 

The idea that the parental prosody helps the learning child to achieve his identification 
task appears recently in the linguistics litterature. Informally speaking, the prosody of 
a language is its characateristic music, which contains among other things, its typical 
stress and intonational patterns. Phonologists commonly accept the assumptiom that the 
prosody of a language depends on its syntax, even if a learning child acquires prosody 
before fixing his grammar. Therefore is natural to suppose that once acquired, prosody 
provides the learning child with hints about the parental grammar. This is the point of 
view we adopt here. We refer the reader to Galves and Galves 1993, where this point of 
view is applyed to a concrete linguistic situation. 

An identification model must take into consideration the fact that languages change. 
Following Lightfoot 1979, grammatical changes occur during the acquisition process. From 
time to time a generation of learning children chose a grammar which is different from the 
parental one. It is has been argued that some of those changes may have been induced by a 
former prosodic change. Therefore the model must account both for the robustness of the 
acquisition process, and for the possibility of misidentification driven by some particular 
prosodic choices. 

The learning process can be naturally considered as a random process. The sequence 
of sentences the learning child receives from his parents does not follow any deteministic 
order (parents do not follow any kind of "manual" to teach a language to their child). 
This random process is stationary in time and its law depends on the the parental syntax 
and prosody. Therefore the basis of a reasonable model of language acquisition must be 
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a probability measure having the language as its sample space and having the syntax 
and the prosody as parameters. The Thermodynamical Formalism provides a natural 
way to express this. We refer the reader to Ruelle 1978 for a general presentation of the 
Thermodynamical Formalism. 

The issue we consider here was firt addressed in a rigorous mathematical way in Gold 
1967 through the identification in the limit model. This was not a probabilistic model and 
did not take into consideration prosody as an element playing a role in the identification 
process. This model ended by a constat d'echec. The identification in the limit procedure 
never converges to a unique grammar. To overcome this failure, it has been suggested in 
the linguistic litterature (Berwick 1985, following Angluin 1980) that an extra principle 
should be taken into consideration, the so called Subset Principle. However a probabilistic 
point of view like the one we adopt here solves the problem in a more natural way. 

In the present paper we restrict our study to what in Chomsky's hierarchy is called 
regular grammars (cf. Chomsky 1963). Since Chomsky 1956, it is well known that regular 
grammars are just too a rough concept to catch the subtle properties of natural languages. 
However we do believe that our mathematical results express in a simplified way part of 
the real story. 

2. DEFINITIONS AND RESULTS 

A lexicon is a finite set A. A grammar G acting on the lexicon A is a matrix indexed by A 
and with entries equal to or 1. We will only consider irreducible and aperiodic matrices, 
i.e. there is an integer k such that all the entries of the matrix G k are nonzero. These 
matrices are also called primitive in the literature (see Horn and Johnson 1985). We will 
denote by Q the set of all such grammars. 
The language generated by G is the set 

L(G) = {(x ,---), x 3 £ A, G Xj>Xj+1 = 1, j > 0} . 

We introduce a partial order in G in the following way. If G and G' belong to Q we 
say that G < G' if for all pairs (x,y) £ A 2 we have G(x,y) < G'(x,y) and the inequality 
is strict for at least one pair. 

Note that G < G' is equivalent to L(G) L(G'). 

Let the sampler S n be the map from A^ to A n which gives the first n symbols of an 
infinite string. 

We are interested in the problem of identifying a grammar in Q given a sample pro- 
duced by S n acting on the language defined by a fixed but unknown grammar. It is natural 
to consider S n as a random variable. In order to make this precise, for every G £ Q we 
introduce a probability measure on L(G) equiped with the usual cr-field induced by the 
product cr-algebra. 

Since the grammar G is the unknown in our problem, we need a canonical construction 
of the probability measure. A natural way of doing this is to fix a real valued Holder 
continuous function cf> on A^, and to associate to any grammar G £ Q the Gibbs state 
with potential cf>. We will denote this Gibbs measure by //^ . The classical references to 
Gibbs measures are Bowen 1975 and Ruelle 1978. An extensive and up-to date reference 
is Parry and Pollicott 1990. In particular the reader will find there a proof of the existence 
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and unicity of the measure /j^ for <f> in C a , the Banach space of a-H61der continuous 
functions, equiped with the usual C a norm, for any fixed a. 



We recall that /j^ is the unique measure such that there is a positive constant C > 1 
such that for any element x_ of L(G) and for any integer n we have 



^({y ■ S n (y) = S n (x)}) 




where P = P(<f>, G) is the pressure associated to the potential <f> on L(G) (see Theorem 1.2 
in Bowen 1975). 

^From now on we will use the shorthand notation [SV^x)] to denote the cylindrical set 
{y : S n (y) = (x ,- ■ • 

For a fixed </>, and for any string x_ } we define the sequence of Maximum Likelihood 
subsets M2(x) (n > 1) of Q by 



We can now define the Maximum Likelihood Identification Procedure: for any 

n, given <f> and the sample S n (x_) the learner chooses a grammar belonging to Ai^(x_). This 
procedure is non ambiguous if Ai^x) is a singleton. 

Our first identification Theorem says that the Maximum Likelihood Procedure always 
identifies the departure grammar in the limit as n diverges. 

Theorem A. For any potential <f> and any grammar G the Maximal Likelihood sets Ai^(x_) 
converges to {G} for ^ almost all choices of diverges. 

The above Theorem accounts for the robustness of the learning process. A child which 
uses the Maximum Likelihood Procedures to identify the parental grammar succeeds using 
a finite sample of positive evidences. 

Nevertheless, languages change. Since in a natural language acquisition situation the 
identification is done with a fixed n which is biologically defined, it seems reasonable to 
think of a model in which the Identification Procedure is based on a large but finite sample. 
In this case, the Maximum Likelihood Procedure can given an unambiguous answer which 
is neverthless different from the departure grammar. This is summarized in the following 
Proposition, which is trivial and will not be proved. 

Proposition B. For any n > 1, G and G 1 in Q, such that G > G 1 , and any e > there 
exists a Holder continuous potential <f> such that 




$({x : G' G M%(x) but G i M^x)} > 1 - e . 



This model is not satisfactory, since it only describes changes leading to smaller gram- 
mars (i.e. grammars which allow less transitions than the parental one). 
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Trying to improve the model, we introduce a new procedure which coincides with the 
Maximum Likelihood aproach in most cases but nevertheless even in the limit of diverging 
n can lead to a new grammar, which can be strictly greater than the original one. The 
next two Theorems show how this may occur under a Minimum Entropy Identification 
Procedure. 

Given the Gibbs state //^ , let h(jj^) denote its Kolmogorov- Sinai entropy (see Bowen 
1975 for the definition). 

The Shannon-McMillan-Breiman Theorem says that the //^ measure of a cylindrical 

set [SV^x)] is tipically of order e~ nh ^'f> \ This suggests that a Minimum Entropy Criterium 
could be used instead of the Maximum Likelihood Procedure we have just described. As a 
matter of fact Theorem C bellow shows that both approachs coincide for potentials which 
are close to the null potential, i.e. potentials belonging to 

O r = {<f> : IWIc- <r} 

the C a ball with radius r centered at the null potential, where r is sufficiently small. 
We define the Minimum Entropy Subset €^{x) by 

^{•L} = {G '■ [S n (%_)] C L(G) and h(jj^) is minimal} . 

We may now introduce the Minimum Entropy Identification Procedure. Given 
<f>, x_, and n the learner chooses a grammar belonging to €^{x_). 

Theorem C. There exists a positive real number r such that for any potential <f> in O r 
and any grammar G the Minimum Entropy sets £^(x_) converge to {G} for //^ almost all 
choices of x_, as n diverges. 

Theorem D. For any G and G' in Q, such that G < G' , there exists a Holder continuous 
potential cf> such that for //^ almost every x_ and for any n large enough G' £ but 

g i s;(x). 
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3. PROOF OF THEOREM A. 

Theorem A will be proved as soon as we show that for any potential <f> and n large enough 
the Maximum Likelihood set excludes both grammars which have an entry smaller than 
the original grammar, and grammars which are strictly larger. This is done in the next 
two lemmata. 

Lemma 1. Let G and G' he two grammars such that G' < G. Then 

hm fi({x : G' e M$(x)}) = . 



Proof. This follows directly from the Ergodic Theorem, which says that every event which 
has positive probability, does indeed occur. 

Lemma 2. For any <f> £ C a there is an integer n(cf>) such that for any pair G and G' of 
grammars such that G < G' , then for any string x_ £ L(G) we have 

vf([Sn(x)])<^([S n (x)}) 

for all n > n(cf>). 

Proof. From inequality (1) and the finiteness of (?, it follows that there is a constant C 
(which depends only on cf>) such that for any integer n and any string x_ £ L(G) we have 

fi([S n (x)]) > C7 c »<^ G >-^ G '>Vf([S n &]) . 

To conclude the proof it is enough to use the following proposition. 
Proposition 3. Let <f> £ C a (A SN ), then if G < G' , we have 

P(<f>,G)<P(<f>,G'). 

Note in particular that in the above proposition the inequality is strict. 
Proof. We recall (see Bowen 1975) that the pressure of <f> is the logarithm of the largest 
eigenvalue of the transfer operator defined on C^{L(G)) (0 < fi < a < 1) by 

4^o,ii,-) = ^ (x ' xo ' xi '-V(^o,^i,---) • 

xGA, G(x,x ) = l 

In what follows, if there is no danger of confusion, we shall use the shorter notation 
Cg instead of C G . 

Note that if G < G' and tp is non-negative then Cq^ ^ ^G'V'- 

We recall also that in the Banach space C@(L(G)), the operator Cg has a simple 
isolated eigenvalue denoted below by \(G) which is real and positive, the rest of the 
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spectrum is contained in a disk centered at the origin with radius strictly smaller than this 
number \(G). It is known that log(A(G)) = P(<f),G). 

Moreover, the associated eigenvector tp G is positive and bounded below away from 
zero, and the associated eigencovector is a positive measure. Similar results hold of 
course for G' } and we are going to prove that \(G) < \(G'). 

This will follow from the eigenvalue equation for G' considered on L(G). We have 
indeed for x_ £ L(G) the relation 

\(G')ipG>(x) = Cg'^G'(x) = Cg^G'(x) + r G '{x) , 

where 

r G >(x)= ]T e^ x ' x °^-^ G ,(x,x ,x 1 ,---) . 

x£A ,G(x,x ) = ,G'(x,x ) = l 

Note however that for some choice of xo, r G i(x_) may be equal to zero. In order to deal with 
this problem we iterate the above equality n times where n is the smallest integer such 
that all entries of the matrix G n are non zero (this number is finite since G is irreducible 
and aperiodic). We obtain since r G i > and Co is positivity preserving 

\(G') n ihi(x) > C n G ih>(x) + Cervix) , (2) 

where 

n 

C G - 1 r G i{x)= ]T JJ e 0(x_„_ 1+i ,x_„ +i ,...) ^ x _ n _ uX _ n ^...). 

i_„_i,-i_i£A j=0 
G'(x_„_i,x_ n ) = G(x_ n ,x_ n _|_i) = ---=G(x_i,x ) = l 

Form our choice of n, and the aperiodicity of G } we conclude that for any x_ £ L{G\ there 
is at least one term in the above sum which is non zero. Moreover, since the function ipc' 
is bounded below away from zero (and bounded above since it is continuous), we derive 
that there is a number r/ > such that 

£ G ~ lr G'{x) > rj^ G '{x) , 

which implies by (2) that 

(\(G') n -r,)^ G > >C n G ^ G , 

on L(G). Since the eigencovector a G is a positive measure, and since tp G i is stricly positive, 
we have a G (tjj G i ) > 0. Therefore if we apply a G to the two members of the above inequality, 
we get 

\(G') n -T]> \(G) n . 



This implies P(<f>,G') = logA(G') > logA(G) = P(<f>,G). 
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4. PROOF OF THEOREM C 

Let fi G be the Gibbs state associated to the null potential. We will first prove that the 
Minimum Entropy set converges to the original grammar with probability 1 with respect to 
fi G . Theorem C will then follow by continuity of the pressure as a function of the potential 
(cf Parry and Pollicott 1990), Theorem A and Lemma 7. 

We recall that for a matrix the eigenvalue with largest modulus is simple and 

positive. The associated eigenvalue is the exponential of the topological entropy ht ov (G) 
of the Markov shift, and the associated eigenvalue is a vector with strictly positive entries 
(see Horn and Johnson 1985). In the notation of the previous section this corresponds to 
the null potential <f> = (see Bowen 1975). 

We now observe that if G < G' } there is a matrix R with entries equal to or 1 such 
that G' = G + R. However, if ht ov (G) = ht ov (G') } it follows form exercise 8.4.15 in Horn 
and Johnson 1985 that R = 0, i.e. G' = G. 

The proof of Theorem C is a direct consequence of the following lemmata. 

Lemma 4. For any G £ Q and for p G almost any x_ £ L(G) 

lim lo g"° G « g n(*]) = _ h {G) 

n^oo n 

Proof. The result follows directly from the Shannon-McMillan-Breiman Theorem. 

Lemma 5. Let G and G' he two elements of Q with the same topological entropy. If 
L{G) C L(G'), then G = G' . 

Proof. Let C be the cone of vectors in 1R® with positive coordinates, where denotes the 
cardinal of the lexicon A. We have that for any integer n 

G n C C C and G' n C C C . 

We will denote by p (or p G ) the eigenvector of G corresponding to the largest positive 
eigenvalue A = e htop( - G \ normalized by the condition 

i£A 

We also denote by A' the corresponding eigenvalue of G', and our hypothesis implies 
that A = A'. We will derive a contradiction from the assertion that 

L(G) L(G') . 

Since however L(G) C L(G') } we have G' = G + B where B is a matrix of dimension 6 
with entries equal to zero or one and with at least one non zero entry. We observe that 
due to the irreducible and aperiodic property, there is an integer m such that for any pair 
of indices 

G™ > . 



9 



We now fix n = m + 1, and conclude that for the order among vectors of IR associated to 
the cone C 

G' m+1 P > G m+1 P + G m Bp . 

We recall that v\ > V2 iff v\ — V2 G C (see Bowen 1975). ^From our above choice of m, 
since B has at least one non zero entry and since the components of p are strictly positive, 
we conclude that there is a strictly positive number r/ such that 

G' m+1 p>(\ m+1 + V ) P . 
Since G' maps the cone C into itself we have for any integer k 

G' kim+1) p>(X m+1 +r,) k p, 
which implies A' > A, a contradiction. 

Lemma 6. There is a positive real number r such that if cf> G O r , the map G — > h(jj^) is 
strictly monotone increasing. 

Proof. The result follows from the continuity of h(jj^) with respect to cf> for G G Q (cf 
Parry and Pollicott 1990) and Lemma 5. 

Lemma 7. There is a positive real number r such that for any <f> G O r and for any G G Q, 

hm $({x : M;(x) = S;(x) = {G}}) = 1 . 

Proof. The result follows from Lemma 6. 

Theorem C now follows from Theorem A and Lemma 7. 
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5. PROOF OF THEOREM D 

The proof of Theorem D follows from Lemma 8. 

Lemma 8. Given two grammars G and G' in Q with G < G' , there exists a Holder 
continuous potential cf> such that 

Proof. The idea of the proof is to find a potential such that the Gibbs state will look like 
the invariant measure supported by a periodic orbit (which has of course zero entropy). 
As in the previous theorem, the matrix G' has at least one more entry equal to 1 than 
the matrix G. Therefore we can find a periodic orbit for the subshift of G' which is not 
admissible for G. Let (yo, • • • , y g _i) be such a periodic orbit with minimal period q. In 
other words, we have 

( ; 'r,;. 1 for 0<j<g-l, ^=1 

and we can also assume that 

We now define the function cf> as follows. Let E be a positive number to be fixed later on, 
we set </>(yo, • • • , y q ) = E if y q = yo, and there is an integer < / < q — 1 such that for any 
integer < j < q — 1, yj = yj+i (mod g)? an d ^(yo, ' ' ' , Vq) = otherwise. Note that since 
cf> is a cylindrical function, it is Holder continuous. 

On the set L(G) which is the phase space of the subshift associated to G } the function 
cf> is equal to 0. In this case, the corresponding Gibbs measure has maximal Kolmogorov- 
Sinai entropy, i.e. h(jj^) = ht op (G). 

We now consider the Gibbs state on L(G'). We observe that for a fixed positive 
number /3, the transfer operator CPq, associated to /3(f) on the subshift of G' and given by 

4^(x 1 ,---)= £ e^'-^V^o,---), 

maps the space of cylindrical functions of the first q variables into itself (we will of course 
only consider admissible sets of q variables with respect to the matrix G'). This is a 
finite dimensional subspace of the space of Holder continuous function, and if we find in 
this subspace a positive eigenvalue with a stricly positive eigenvector, by uniqueness this 
eigenvalue must be the exponential of the pressure of fief). 

From now on it will be more convenient to use a matrix notation. Let m denote the 
number of sequences of length q admissible by G' . The real valued cylindrical functions 
which depend only on the first q symbols form a real vector space of dimension m. It is 
easy to verify that in this space the transfer operator can be represented by a matrix 
which takes the following form 

M H = z-^Mo+zMi) 



11 



where z = exp(—/3E) } Mo and Mi are matrices with entries equal to or 1. Moreover, 
there is a basis of the space IR m denoted by eo, • • • , e m -i such that 

M ej =e J+1 (mod q) for < j < q - 1 
Moej =0 else . 

Note that Mo is not primitive, but its spectrum is composed of the eigenvalue with 
multiplicity m — q and the q root of unity which are simple eigenvalues. In particular, 1 
is a simple eigenvalue. By analytic perturbation theory (see Parry and Pollicott 1990), we 
conclude that for z small enough, the matrix zM^ has a simple eigenvalue \{z) which is 
an analytic function of z which tends to 1 if z — >■ 0. We also know that the matrix zM^ 
is such that all it's entries are non negative, and moreover there is a power of this matrix 
with all it's entries strictly positive. This matrix has therefore a real positive eigenvalue 
which is simple and is also the unique point in the spectrum with maximum modulus . The 
associated eigenvector has strictly positive coordinates. For z small enough we conclude 
that this point must be \(z). If we denote by P{z) the function log \{z) we have 



and this number tends to zero if E diverges. Therefore, for E large enough, it will be 
smaller than ht ov (G) and the theorem is proven. 
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